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Abstract 

For a classification problem described by the joint density P{uj,x), models 
of P{uj = ui'\x,x') (the "Bayesian similarity measure") have been shown to be 
an optimal similarity measure for nearest neighbor classification. This paper an- 
alyzes demonstrates several additional properties of that conditional distribution. 
The paper first shows that we can reconstruct, up to class labels, the class posterior 
distribution P{uj\x) given P{lj — lij'\x, x'), gives a procedure for recovering the 
class labels, and gives an asymptotically Bayes-optimal classification procedure. It 
also shows, given such an optimal similarity measure, how to construct a classifier 
that outperforms the nearest neighbor classifier and achieves Bayes-optimal classi- 
fication rates. The paper then analyzes Bayesian similarity in a framework where 
a classifier faces a number of related classification tasks (multitask learning) and 
illustrates that reconstruction of the class posterior distribution is not possible in 
general. Finally, the paper identifies a distinct class of classification problems us- 
ing P{lli = lli'\x, x') and shows that using P{lu = lu'\x, x') to solve those problems 
is the Bayes optimal solution. 



1 Introduction 

Statistical models of similarity have become increasingly important in recent work on 
information retrieval Q, case-based reasoning 15], pattern recognition! l], and com- 
puter vision I8l|9l- Of particular interest is Bayesian similarity, a discriminatively 
trained model of P{x and x' are in the same classja;, x'), which we will abbreviate as 
P(same|x, x'). These models have been demonstrated to work well in a number of 
pattern recognition and visual object recognition problems H [8] |9l [3 1 . 

It is easy to see that nearest neighbor classification using 1 — P(same|a;, x') min- 
imizes the risk that the class labels for x and x' differ and therefore is optimal for 
1 -nearest neighbor classification |8]|9l- However, beyond that observation, there have 
been several kinds of analysis of Bayesian similarity. The first, presented by Mahamud 
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Is] |9] is an analysis considering a single instance of a classification problem, deter- 
mined by a joint distribution P{uj, x) of class labels uj and feature vectors x. The 
authors also argue for the existence of useful invariance properties of Bayesian similar- 
ity functions when those functions have a specific form |9|. The second is an analysis 
based on a hierarchical Bayesian framework presented by Breuel |3|, which effectively 
considers Bayesian similarity in the context of a distribution of related classification 
tasks. 

This paper analyzes the relationship between Bayesian similarity P{savr\e\x,x') 
and the class posterior distribution P{iu\x) in both the non-hierarchical and hierarchical 
cases and uses those results to construct an asymptotically Bayes-optimal classification 
procedure using Bayesian similarity. It also presents a new statistical model for the 
kinds of discrimination tasks described in 1 9 1 and demonstrates that Bayesian similarity 
is the Bayes-optimal solution for those tasks. The implications of these results for 
applications of Bayesian similarity will be discussed at the end. 

2 Bayesian Similarity 

Consider a classification problem in which feature vectors a; G X = K" and class 
variables w e {1, . . . , c} are jointly distributed according to some distribution P{x, ui). 

Definition 1 Let P{x,uj) be the distribution for a classification problem. Given two 
samples from this distribution, (x, to) and {x' , uj'), we define Bayesian similarity as the 
probability P(uj = uj'\x, x'). When co and u)' are clear from context, we will usually 
denote this as P(same|a;, x'). 

Let x^ be a sample that has somehow been selected as a "prototype" for class lu. It is 
natural to classify some unknown feature vector x using the rule: 

D{x) = axgTiia,'yiP{ijj = uj'\x^Xui') (1) 

uj' 

That is, we classify the unknown feature vector x using the class associated with the 
training example x^i that is most similar to it in the sense of Bayesian similarity. 

Observe that Equation [T] is analogous to nearest neighbor classification if we use 
d{x,x') = 1 — P{uj^uj'\x,x') as the similarity function. Because P{uj — uj'\x,x') is, 
by definition, the probability that x and x' have the same class label, it is also the Bayes- 
optimal misclassification rate using a nearest neighbor rule; therefore, nearest-neighbor 
classification using d{x,x') — —P{uj~uj'\x,x') is an optimal nearest neighbor classi- 
fier. 

Nearest neighbor classification using d{x, x') = 1 — P{uj = uj'\x, x') is not neces- 
sarily Bayes-optimal; in fact, the asymptotic bounds on its performance are no better 
than those known for traditional nearest neighbor methods |[8l|9l. However, when x^ 
is an unambiguous prototype, that is, P{uj\Xi^) = 1, then classification with Bayesian 
similarity is Bayes optimal: 

P{ijj^uj'\x,Xuj') ^^P{uj\x)P{uj'\x') = Y^P{uj\x)5{lu,lu') ^ P{J\x) (2) 
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That is, in the case of unambiguous training examples, Equation [T] just reduces to 
Bayes-optimal classification. 



3 Relationship between P(same| ') and P{uj\x) 

While we have seen some relationships between nearest neighbor classification and 
Bayesian similarity in the previous section and in the literature |8 9|, the question 
arises whether there are better ways of taking advantage of P(same|a;, x') and whether 
we can achieve Bayes-optimal classification using a Bayesian similarity framework. 

Consider a two-class classification problem; that is, lo G {0, 1}. Now, examine 
the probability P(same|a;, x); that is, the probability that two samples with the same 
feature vector actually have the same class. This probability is not equal to 1 in general 
because the both of the class conditional densities P[x\u) — 0) and P{x\io = 1) may be 
nonzero at x. We obtain: 

P{same\x,x) ^ P{u = Q\x)P{uj^O\x) + P{uj = \\x)P{uj = 1\x) (3) 

= [P{LO = Q\x)f + {P{LO=l\x)f (4) 

= {P{u = {)\x)f + P{uj^Q\x)f (5) 
We can solve this for P{lo = 0|a;) up to a sign: 

1 1 



P{ljj = Q\x) = - ± -yj2P {same\x, x) -1 (6) 

Note that P(same|a;, a;) G [i, 1], so this is well-defined and real. Given P(same|a;, a;'), 
in particular, we have P(same|2;, x), and from Equation|6] we see that we can recon- 
struct P{uj = 0|x) up to a single choice of a sign at each point. Of course, while this 
gives us a lot of information about P{ijj\x), the unknown sign is crucial for classifica- 
tion. 

Now consider the decision regions for the minimum error decision rule: Dq = 
{x\P{u} = Q\x) > i} and Di — {x\P{oj = 0\x) < i}. That is, given an unknown 
feature vector x, we decide w = when x £ Dq and w = 1 when x ^ Di. If x is not 
contained in either decision region, we can make an arbitrary choice between classes 
and 1. 

Now consider two points x and x' . Assume they both come from Dq: Then, for 
some positive d and d\ P{lj — 0\x) — ^ + d and P(ti; = 0|a;') = ^ + d'. Therefore, 

P(same|x,x') = (i + + d') + (i _ _ d') (7) 

= i + 2dd' (8) 

If both come from Di, the result is the same. If one comes from Dq and the other 
comes from Di, then, for some positive d and d', 

Pisame\x,x') = (^l + d){^ - d') + - d)i^ + d') (9) 

= i - 2dd' (10) 
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Since the d and d' are both positive, this means that if x and x' are in the same decision 
region, P(same|a:, x') > ^, and otherwise P(same|a;, a;') < ^. Therefore, for any 
two points X and x', we can decide whether they are in the same decision region by 
seeing whether P (same I x') > i. 

Using these two results, we can now state the following theorem: 

Theorem 1 We can reconstruct either P(tt'|a;) or 1 — P{u}\x)from P(same|a;, x'). 

Proof. Compute the two possible values for P{u) — Q\x) using Equation |6l Pick 
a point X at which P{uj = 0|.t) ^ i, i.e., where P(same|a::, x') ^ i. P{uu = Q\x) 
is then either less than ^ or greater than i. Arbitrarily pick one of these; this is a 
choice of membership of x in Dq or Di. Use the constraint P{uo\x,x') > ^ for points 
in the same decision region to assign all other points to decision regions. Given the 
decision regions and the values from Equation|6] we have reconstructed either P{llj\x) 
or 1 ~ P{llj\x), depending on whether our arbitrary choice above was correct or not. • 
This means that if we have an estimate of the Bayesian similarity function P(same |x, x'), 
we have already identified the class posterior distribution up to a choice of two: P{llj\x) 
(the correct class posterior distribution), and 1 — P{lu\x). 

Once we have P(same|a;, x'), training samples only serve to distinguish the two 
possibilities for the reconstructed class posterior distributions. Since the prior proba- 
bility for either choice is ^, we can determine which of the two possibilities applies by 
considering the ratio of the probability of the samples given the models. That is, if we 
write PA((jj|a;) and Pb{uj\x) for the two possibilities, then we evaluate 

U^PBiuJ^,X^) U^ PB{uJr\Xi)P{Xi) (^^ NO 

If r > 1, then Pa is the more likely possibility, otherwise Pb is the more likely possi- 
bility. 

So, if we take this together, we have a Bayes-optimal classification procedure given 
P(same|a;, x') and a set of prototypes or samples {uji,Xi): first, we compute the two 
possible values of P{oj\x) at each point using Equation|6] then we use EquationQto 
assign those values to the two possible branches, and then finally use the prototypes to 
identify which of the two branches is the more likely using Equation [TT] Finally, we 
classify using the reconstructed class conditional distribution P{uj\x). 

The only purpose that training samples obtained in addition to the Bayesian simi- 
larity function P(same|x, x') serve in this procedure is to determine which of the two 
possible choices of the reconstructed P{u}\x) is the correct one. Asymptotically, the 
above procedure for making the choice between the two possibilities, can be seen to be 
correct with probability one. Therefore, this classification procedure is asymptotically 
Bayes-optimal. 

Compare that with the proposed use of Bayesian similarity in a nearest neighbor 
classification procedure. First, the approach described above is very different from a 
nearest neighbor classifier, because it integrates information from all samples. Second, 
given P(same|a:, x') and labeled training examples, a nearest neighbor classifier using 
Bayesian similarity, even asymptotically, is not guaranteed to come within more than 
a factor of two of the Bayes-optimal error rate lH |9], while the procedure described 
above will almost always reach the Bayes-optimal error rate. 
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4 Multi-Class Case 



The previous section showed that for one large class of classification problems (namely, 
two-class classification problems), knowledge of the Bayesian similarity function is 
essentially equivalent to knowledge of the class posterior distributions. That already 
demonstrates that, given P(same|a::, x'), 1-NN classification is not an admissible clas- 
sification procedure (i.e., there is a procedure that is uniformly better). However, while 
it is not central to the main argument, it is an interesting question to ask whether that 
approach generalizes to the multi-class case. Let us sketch the argument here without 
making a full, formal proof. 
As before, 

P(same|a;, x') = ^ P{lu = i\x)P{uj = i\x') (12) 

i 

Now, assume that are looking at c classes and 71 points .Tj and write ~ P{lo — i\xj). 
Also, write for P(same|a;i, Xj). Then, we have 

= PkiPk] (13) 
k 

The Sij are ^n{n — l) given quantities, and there are (c — 1) n unknown quantities pij. 
We have enough equations to solve for the unknowns when n > 2c — 1. 

Of course, as in the two-class case, given any solution p^ , any permutation of class 
labels remains a solution, and as before, this is expressed as an uncertainty of signs in 
the system of equations given by Equation [T3] But, as in the two-class case, there is 
only a finite number of possibilities, and we can distinguish among them by computing 
the likelihoods of the actual set of training samples for each of the different possible 
solutions. Therefore, we see that, as in the two-class case, we can reconstruct the class 
conditional density up to permutation. As before, any additional training examples or 
prototypes we use merely serve to pick the most Ukely possibility among this finite set. 

5 Batched Hierarchical Bayesian Similarity 

In the previous sections, we have seen that knowledge of the Bayesian similarity func- 
tion P(same|a;, x') is mostly equivalent to knowledge of the class posterior distribu- 
tion P{lo\x). In effect, Bayesian similarity is a suboptimal application of P{lo\x). This 
raises the question of whether using Bayesian similarity for nearest neighbor classifi- 
cation is of any use at all. Both this and the next section answer that question in the 
affirmative. While Bayesian similarity is not useful for simple classification problems, 
it is useful for hierarchical Bayesian problems and actually Bayes-optimal for certain 
discrimination problems. In fact, all previous applications of Bayesian similarity in the 
literature, including |9| are probably better analyzed in one of these two frameworks 
than as simple classification problems. 

One way of understanding learning similarity measures for nearest neighbor classi- 
fiers is to think of the problem as learning a similarity measure for a collection of related 
task. For example, in an OCR problem, a similarity function might generally be able 
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to evaluate the similarity of different character shapes to one another, but when applied 
to a specific classification problem, the identity of individual characters is given by a 
set of training examples. See ||6l[10l for further information. The idea of a collection 
of related classification problems can be formalized in its most general form as that of 
hierarchical Bayesian methods. After describing hierarchical Bayesian classification, 
we will return to its relationship with Bayesian similarity. 

In a hierarchical Bayesian framework, we assume that the distribution governing 
the classification problem is parameterized by some parameter vector 9, which is itself 
distributed according to some prior P{9). We write Pg {x\uj) or, equivalently, P{x\u}, 9) 
for the parameterized class conditional density. If we are just given individual samples 
from such a hierarchical Bayesian model, the model is merely a particular representa- 
tion of a non-hierarchical density using an integral ||2l: 



In a batched hierarchical Bayesian problem, a classifier faces a collection of batches, 
where the samples {oji, Xi) within each batch are drawn using the same parameter 9. 
The Bayes-optimal classification for a batch of samples B = {. . . , {coi, Xi), . . .} can 
be derived from the class conditional density for that batch: 



Note that this differs from a non-batched hierarchical Bayesian model, for which the 
class conditional density for the same batch would be P(a::|cj) = Jlj / P{^i\^i:(^)P{(^)d9. 

Let us now return to the question of how a hierarchical Bayesian approach relates 
to Bayesian similarity. Trivially, we have 



This function can be approximated by taking pairs of samples (w, x) and {uj\ x') from 
the same batch 9 and training a classifier with it. We refer to this as batched training. 
That is, it is learned analogously to Bayesian similarity in the non-hierarchical cases, 
but all pairs of feature vectors x and x' used for training are taken from the same batch. 

What is the equivalent to Equations [T2l and [T3]? Those equations relied on the rela- 
tionship P{uj — uj'\x, x') — P{uj\x)P{ui\x'). But the equivalent relationship is not true 
in the hierarchical Bayesian case. While P(cj — u}'\x, x' , 9) = P{u)\x^ 9)P{ljj\x' , 9), 
the same is not true in general for the corresponding marginal distributions after inte- 

gration over 0: P{uj = u}'\x,x') = P{u}\x)P{uj\x'). Therefore, given n sample points 
xi, . . . , Xn, in general, we may have to estimate the values for all c" combinations of 
classifications P{uji, . . . , 0Jn\xi, ■ ■ ■ , and for that, the ^n{n — 1) Bayesian simi- 
larity values P(same|a;i, Xj) do not provide sufficient information in general. 

Therefore, for hierarchical Bayesian classification, knowledge of the Bayesian sim- 
ilarity is not, in general, equivalent to knowledge of the class posterior distributions. 
However, even in a hierarchical Bayesian framework, it is still true that Bayesian sim- 
ilarity is the optimal similarity function for nearest neighbor classification: for any 




(14) 




(15) 




(16) 
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1 - P{u! = u!'\x,x') = 1 - J P{Lu = Lu'\x,x',9)P{0)d0 IS the Hsk that the class labels 
associated with x and x' differ, and minimizing that risk minimizes the overall risk of 
misclassification in a 1 -nearest neighbor framework (this is the analogous argument to 
that made in |l8][9l). We can therefore state: 

Theorem 2 Batch-trained Bayesian similarity is the optimal distance function for 1- 
nearest neighbor classification in a batched hierarchical Bayesian classification prob- 
lem. 

6 Discrimination Tasks 

In the previous section, we looked at a hierarchical Bayesian classification task. Let us 
now look at a closely related problem. 

Mahamud f8','9l considers the problem of determining whether two image patches 
in different images come from the same object or different objects. For this, they train 
a Bayesian similarity model P(same|a::, x') and use it to make this decision for real 
images. 

They analyze this by formulating it as a non-hierarchical classification problem 
and postulate an underlying joint distribution P{lo, x) between class labels and feature 
vectors. That presupposes that some class structure exists over the image patches; that 
is, that image patches can be classified into a fixed set of categories and that the purpose 
of nearest neighbor classification is to recover those categories. But the authors do not 
demonstrate that such a class structure actually exist, and its existence does not appear 
particularly plausible. 

Consider, for example, feature vectors consisting of color histograms over image 
patches. While it is meaningful to ask whether two such color histograms are suffi- 
ciently similar between two images to have come from the same object, there is no 
obvious classification of color histograms that is independent of the specific problem 
instance. 

There are two different condition, the same condition and the different condition. 
Let us write S = 1 and 5 = for the two conditions, respectively. Under the S = 1 
condition, two unknown feature vectors x and x' are produced by the same patch, 
parameterized as 6. Under the S — Q condition, two unknown feature vectors are 
produced by different patches, parameterized as 6 and 6' . The task Mahamud (S) |9l 
set out to solve is whether a given pair of feature vectors x and x' was produced under 
the 5 = 1 or 5 = conditions. In order to solve this problem, they postulate the 
existence of an underlying classification problem P{ljj,x) and then address it using 
non-hierarchical Bayesian similarity. Their justification for using Bayesian similarity 
is that u} is unobservable, so training a traditional classifier would be impossible. 

If we don't invoke an underlying, unobservable class structure, how should we an- 
alyze this kind of discrimination problem? Let us say that the possible surface patches 
on a 3D object are parameterized by some parameter vector 9. Furthermore, let the 
viewing parameters for that surface patch be given as and that there is some random 
noise variable v. Then, the feature vector representing the appearance of the surface 



7 



patch in the image, for unknown viewing parameters and noise, is given by 

P{x\e)^ j P{x\e,(t),v)P{(j))P{y)d<j)dv (17) 

The problem is now to determine whether two samples x and x' come from the same 
distribution P[x\9). 

For concreteness, let us write down the distributions involved in this problem. The 
class conditional density under the same condition is 

p{x,x'\s = 1) = y" P{x\9)p{x'\e)p{e)de (is) 

For the S = Q condition, it is given by 

p{x,x'\s = 0) = y" P{x\9)p{9)de j p{x'\e')P{e')de' = p{x)p{x') (19) 

The joint distribution is just the mixture: 

P{x, x', S) = P{x, x'\S = 1)P{S = 1) + P{x, x'\S = 0)P{S = 0) (20) 
Applying Bayes rule gives us 

Nowhere in this derivation of the posterior distribution was it necessary to postulate 
an underlying class structure. Furthermore, if we obtain a model of P{S\x, x') from 
training data and use it for deciding whether x and x' were generated under S* = or 
5 = 1 conditions, our decision procedure will be Bayes-optimal because P{S\x, x') 
is the optimal discriminant function for S. Suboptimality of the use of P(S'|x, x') for 
classification was a result of the fact that in classification, we are trying to make a 
decision about w, not S. 



1 Discussion 

In this paper, we have seen three distinct uses of Bayesian similarity: as a similar- 
ity measure for non-hierarchical classification problems, as a similarity measure for 
batched hierarchical classification problems, and as a similarity measure for discrimi- 
nation tasks. 

The paper has shown that for non-hierarchical classification problems, models of 
P(same|x, x') are equivalent to models of P{lo\x), up to permutation of the class 
labels. That makes the use of Bayesian similarity for individual classification problems 
merely a variation of learning a classifier. In a sense, P(same|a;, x') is too problem 
specific: it "knows so much" about the particular classification problem P{lo., x) that 
we might as well use P{lo\x) directly. Although this paper did not show it formally. 
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that is likely to be a problem with any optimal similarity measure for nearest neighbor 
classification. 

Intuitively, what we would like is a similarity measure that works well across an 
entire class of related problems. We can formalize this notion of a class of related prob- 
lems in a hierarchical Bayesian framework ||2][Tl|6l[lO|. When we consider Bayesian 
similarity in such a framework, it is not equivalent to knowledge of the class posterior 
distributions anymore. However, the property that it is an optimal similarity function 
for nearest neighbor classification remains. This means that in a hierarchical Bayesian 
setting, Bayesian similarity is a procedure that is distinct from other methods and may 
have useful applications; unlike more direct or generative implementations of hierar- 
chical Bayesian models |6, 10|, Bayesian similarity models appear to be easier to im- 
plement and train. It is important to remember that such hierarchical models are trained 
differently from the non-hierarchical models: for non-hierarchical models, samples x 
and x' used for training P(same|x, x') are taken from the entire distribution, while for 
hierarchical models, such samples are only taken from within a batch that was sampled 
using the same distributional parameters 0. 

Finally, the paper has presented a novel analysis of P(same|a::, x') for discrimina- 
tion tasks like those considered in |8 , 9| and demonstrated that the use of P(same|x, x') 
in such tasks is, in fact, Bayes-optimal. This is an important result because those kinds 
of discrimination class are quite common in computer vision applications. 
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